Lemmatized Latent Semantic Model for Language Model Adaptation of Highly Inflected Languages

نویسندگان

  • Tanel Alumäe
  • Toomas Kirt
چکیده

We present a method to adapt statistical N-gram models for large vocabulary continuous speech recognition of highly inflected languages. The method combines morphological analysis, latent semantic analysis (LSA) and fast marginal adaptation for building topic-adapted trigram models, based on a background language model and very short adaptation texts. We compare words, lemmas and morphemes as basic units for language model adaptation. Experiments on a set of Estonian test texts and broadcast news speech data show that lemmas and morphemes give better performance than words in all cases. In speech recognition experiments, morpheme-based adaptation is found to perform significantly better than lemma-based adaptation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

LSA-based language model adaptation for highly inflected languages

This paper presents a language model topic adaptation framework for highly inflected languages. In such languages, subword units are used as basic units for language modeling. Since such units carry little semantic information, they are not very suitable for topic adaptation. We propose to lemmatize the corpus of training documents before constructing a latent topic model. To adapt language mod...

متن کامل

LSA learner sentence comprehension in agglutinative and non-agglutinative languages

This work has been carried out in the context of automatic evaluation of learner summaries where text comprehension is gained using Latent Semantic Analysis (LSA) and Natural Language Processing (NLP) techniques. We had intuitively observed that lemmatized versions of LSA matrixes resembled better human Basque similarity judgement than the non lemmatized ones. This research was conducted to tes...

متن کامل

A Framework for Language Model Adaptation for Highly-Inflected Slovenian Language

This paper describes a new framework to construct topicadapted language models for large vocabulary speech recognition of highly-inflected Slovenian language. Two important difficulties of high inflectionality in Slovenian language are discussed, out-of-vocabulary rate and feature extraction for topic detection. To use the most popular language models (N-grams) and the well-known classifiers (T...

متن کامل

Rapid Unsupervised Topic Adaptation – a Latent Semantic Approach

In open-domain language exploitation applications, a wide variety of topics with swift topic shifts has to be captured. Consequently, it is crucial to rapidly adapt all language components of a spoken language system. This thesis addresses unsupervised topic adaptation in both monolingual and crosslingual settings. For automatic speech recognition we rapidly adapt a language model on a source l...

متن کامل

Topic detection for language model adaptation of highly-inflected languages by using a fuzzy comparison function

A new framework is proposed to construct corpus-based topicadapted language models for large vocabulary speech recognition of highly-inflected Slovenian language. The proposed techniques can be applied to other Slavic languages, where words are formed by many different inflectional affixatation. In this article an attempt to overcome two important difficulties of highly-inflected languages (hig...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007